Business Analytics

Data Wrangling

Ayush Patel and Jayati Sharma

10 February, 2024

Before we begin

Please install and load the following packages

library(dplyr)
library(tidyverse)
library(openintro)



Access lecture slide from the course landing page

About me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objectives

  • Learn how to import data
  • Learn data wrangling functions like select, arrange, mutate, filter, group by
  • Learn how to pipe functions

First Things First - Packages

  • Package is a collection of of functions
  • Install a package only once but load it for every session

Source : dplyr

Data Wrangling - Loading the dataset

For this lecture, we will be using the census dataset from openintro package

census_data <- openintro::census

Taking a peek at the data

head(census_data)
# A tibble: 6 × 8
  census_year state_fips_code total_family_income   age sex    race_general   
        <int> <fct>                         <int> <int> <fct>  <fct>          
1        2000 Florida                       14550    44 Male   Two major races
2        2000 Florida                       22800    20 Female White          
3        2000 Florida                           0    20 Male   Black          
4        2000 Florida                       23000     6 Female White          
5        2000 Florida                       48000    55 Male   White          
6        2000 Florida                       74000    43 Female White          
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>

Alternatively, you can look at the bottom 6 rows

tail(census_data)
# A tibble: 6 × 8
  census_year state_fips_code total_family_income   age sex    race_general
        <int> <fct>                         <int> <int> <fct>  <fct>       
1        2000 Virginia                      45100     5 Male   White       
2        2000 Virginia                       8500    68 Female White       
3        2000 Virginia                      55400    12 Male   White       
4        2000 Virginia                      15000    60 Male   Black       
5        2000 Virginia                         NA    60 Male   Black       
6        2000 Virginia                      39370    47 Male   White       
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>

Data Wrangling - Select

Select only year, state, family income, age, sex and personal income

select(census_data, 
       census_year, state_fips_code,
       total_family_income, age, sex,
       total_personal_income)
# A tibble: 500 × 6
   census_year state_fips_code total_family_income   age sex   
         <int> <fct>                         <int> <int> <fct> 
 1        2000 Florida                       14550    44 Male  
 2        2000 Florida                       22800    20 Female
 3        2000 Florida                           0    20 Male  
 4        2000 Florida                       23000     6 Female
 5        2000 Florida                       48000    55 Male  
 6        2000 Florida                       74000    43 Female
 7        2000 Florida                       23000    60 Female
 8        2000 Florida                       74000    47 Female
 9        2000 Florida                       60000    54 Female
10        2000 Florida                       14600    58 Female
# ℹ 490 more rows
# ℹ 1 more variable: total_personal_income <int>

Data Wrangling - Select

Alternatively, we could choose what we do not want

select(census_data, 
       -race_general, -marital_status)
# A tibble: 500 × 6
   census_year state_fips_code total_family_income   age sex   
         <int> <fct>                         <int> <int> <fct> 
 1        2000 Florida                       14550    44 Male  
 2        2000 Florida                       22800    20 Female
 3        2000 Florida                           0    20 Male  
 4        2000 Florida                       23000     6 Female
 5        2000 Florida                       48000    55 Male  
 6        2000 Florida                       74000    43 Female
 7        2000 Florida                       23000    60 Female
 8        2000 Florida                       74000    47 Female
 9        2000 Florida                       60000    54 Female
10        2000 Florida                       14600    58 Female
# ℹ 490 more rows
# ℹ 1 more variable: total_personal_income <int>

Also, data types can be selected

select(census_data, 
       where(is.factor))
# A tibble: 500 × 4
   state_fips_code sex    race_general    marital_status        
   <fct>           <fct>  <fct>           <fct>                 
 1 Florida         Male   Two major races Married/spouse present
 2 Florida         Female White           Never married/single  
 3 Florida         Male   Black           Never married/single  
 4 Florida         Female White           Never married/single  
 5 Florida         Male   White           Married/spouse present
 6 Florida         Female White           Married/spouse present
 7 Florida         Female White           Married/spouse present
 8 Florida         Female White           Married/spouse present
 9 Florida         Female Black           Married/spouse present
10 Florida         Female White           Widowed               
# ℹ 490 more rows

But wait…

  • Note that the changes are not saved
  • They can be saved either to existing dataframe by overwriting it
  • Or as a new dataframe
census_data_selected <- select(census_data,
                  -race_general,-marital_status)

Data Wrangling - Arrange

Arranging the data with states in the alphabetical order

arrange(census_data, 
        state_fips_code)
# A tibble: 500 × 8
   census_year state_fips_code total_family_income   age sex    race_general
         <int> <fct>                         <int> <int> <fct>  <fct>       
 1        2000 Alabama                        5500    73 Female Black       
 2        2000 Alabama                       63820    40 Male   White       
 3        2000 Alabama                       11200    60 Female Black       
 4        2000 Alabama                       34500    43 Female Other       
 5        2000 Alabama                       33600     7 Male   White       
 6        2000 Arizona                       32500     9 Female White       
 7        2000 Arizona                       46800    53 Female White       
 8        2000 Arizona                       30000    60 Female White       
 9        2000 Arizona                           0    67 Female White       
10        2000 Arizona                       51000    27 Male   White       
# ℹ 490 more rows
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>

Can you guess how would you arrange in the descending order?

arrange(census_data, 
        -state_fips_code)
# A tibble: 500 × 8
   census_year state_fips_code total_family_income   age sex    race_general   
         <int> <fct>                         <int> <int> <fct>  <fct>          
 1        2000 Florida                       14550    44 Male   Two major races
 2        2000 Florida                       22800    20 Female White          
 3        2000 Florida                           0    20 Male   Black          
 4        2000 Florida                       23000     6 Female White          
 5        2000 Florida                       48000    55 Male   White          
 6        2000 Florida                       74000    43 Female White          
 7        2000 Florida                       23000    60 Female White          
 8        2000 Florida                       74000    47 Female White          
 9        2000 Florida                       60000    54 Female Black          
10        2000 Florida                       14600    58 Female White          
# ℹ 490 more rows
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>

Do it Yourself - 1

  • Load the dataset smoking from the openintro package.
  • Select only the gender, age and smoke variables
  • Select only the numerical variables and arrange the age variable in the descending order
  • Select all variables excep the ethnicity and nationaly variables

Data Wrangling - Rename

Making the variables names more cleaner

rename(census_data, 
       state = state_fips_code)
# A tibble: 500 × 8
   census_year state total_family_income   age sex   race_general marital_status
         <int> <fct>               <int> <int> <fct> <fct>        <fct>         
 1        2000 Flor…               14550    44 Male  Two major r… Married/spous…
 2        2000 Flor…               22800    20 Fema… White        Never married…
 3        2000 Flor…                   0    20 Male  Black        Never married…
 4        2000 Flor…               23000     6 Fema… White        Never married…
 5        2000 Flor…               48000    55 Male  White        Married/spous…
 6        2000 Flor…               74000    43 Fema… White        Married/spous…
 7        2000 Flor…               23000    60 Fema… White        Married/spous…
 8        2000 Flor…               74000    47 Fema… White        Married/spous…
 9        2000 Flor…               60000    54 Fema… Black        Married/spous…
10        2000 Flor…               14600    58 Fema… White        Widowed       
# ℹ 490 more rows
# ℹ 1 more variable: total_personal_income <int>

Data Wrangling - Mutate

  • Creating a new variable from existing ones
  • Finding total income
census_data <- mutate(census_data, 
                      total_income = total_family_income + total_personal_income)

Do it Yourself - 2

  • With the smoking dataset, rename the highest_qualification variable as education
  • Create a new variable to calculate the total number of cigarettes in a week (adding weekends and weekdays)

Data Wrangling - Filter

Filtering ages more than 40

filter(census_data, age >= 40)
# A tibble: 212 × 9
   census_year state_fips_code total_family_income   age sex    race_general   
         <int> <fct>                         <int> <int> <fct>  <fct>          
 1        2000 Florida                       14550    44 Male   Two major races
 2        2000 Florida                       48000    55 Male   White          
 3        2000 Florida                       74000    43 Female White          
 4        2000 Florida                       23000    60 Female White          
 5        2000 Florida                       74000    47 Female White          
 6        2000 Florida                       60000    54 Female Black          
 7        2000 Florida                       14600    58 Female White          
 8        2000 Florida                       37000    51 Female White          
 9        2000 Florida                       32000    62 Female White          
10        2000 Florida                      100100    44 Female White          
# ℹ 202 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
#   total_income <int>

Filtering only females

filter(census_data, sex == "Female")
# A tibble: 232 × 9
   census_year state_fips_code total_family_income   age sex    race_general
         <int> <fct>                         <int> <int> <fct>  <fct>       
 1        2000 Florida                       22800    20 Female White       
 2        2000 Florida                       23000     6 Female White       
 3        2000 Florida                       74000    43 Female White       
 4        2000 Florida                       23000    60 Female White       
 5        2000 Florida                       74000    47 Female White       
 6        2000 Florida                       60000    54 Female Black       
 7        2000 Florida                       14600    58 Female White       
 8        2000 Florida                           0    33 Female White       
 9        2000 Florida                       37000    51 Female White       
10        2000 Florida                       32000    62 Female White       
# ℹ 222 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
#   total_income <int>

Filtering cases where age is above 40 AND sex is female

filter(census_data, age >= 40 & sex == "Female")
# A tibble: 100 × 9
   census_year state_fips_code total_family_income   age sex    race_general
         <int> <fct>                         <int> <int> <fct>  <fct>       
 1        2000 Florida                       74000    43 Female White       
 2        2000 Florida                       23000    60 Female White       
 3        2000 Florida                       74000    47 Female White       
 4        2000 Florida                       60000    54 Female Black       
 5        2000 Florida                       14600    58 Female White       
 6        2000 Florida                       37000    51 Female White       
 7        2000 Florida                       32000    62 Female White       
 8        2000 Florida                      100100    44 Female White       
 9        2000 Florida                       43950    69 Female White       
10        2000 Florida                       31600    80 Female White       
# ℹ 90 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
#   total_income <int>

Data Wrangling - Filter

Filtering cases where age is above 40 OR sex is female

filter(census_data, age >= 40 | sex == "Female")
# A tibble: 344 × 9
   census_year state_fips_code total_family_income   age sex    race_general   
         <int> <fct>                         <int> <int> <fct>  <fct>          
 1        2000 Florida                       14550    44 Male   Two major races
 2        2000 Florida                       22800    20 Female White          
 3        2000 Florida                       23000     6 Female White          
 4        2000 Florida                       48000    55 Male   White          
 5        2000 Florida                       74000    43 Female White          
 6        2000 Florida                       23000    60 Female White          
 7        2000 Florida                       74000    47 Female White          
 8        2000 Florida                       60000    54 Female Black          
 9        2000 Florida                       14600    58 Female White          
10        2000 Florida                           0    33 Female White          
# ℹ 334 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
#   total_income <int>

Filtering cases where age is NOT a female

filter(census_data, sex != "Female")
# A tibble: 268 × 9
   census_year state_fips_code total_family_income   age sex   race_general   
         <int> <fct>                         <int> <int> <fct> <fct>          
 1        2000 Florida                       14550    44 Male  Two major races
 2        2000 Florida                           0    20 Male  Black          
 3        2000 Florida                       48000    55 Male  White          
 4        2000 Florida                      113000     8 Male  White          
 5        2000 Florida                       76900    25 Male  White          
 6        2000 Florida                       48000     1 Male  White          
 7        2000 Florida                       57200    31 Male  White          
 8        2000 Florida                       49000    31 Male  White          
 9        2000 Florida                       50090     2 Male  White          
10        2000 Florida                       64800    47 Male  White          
# ℹ 258 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
#   total_income <int>

Do it Yourself - 3

  • In the smoking dataset, filter males
  • From the people who do not smoke, how many are 35 years old?
  • Filter cases where the respondents is neither Divorced nor English
  • Filter only cases where age is below 20 and if the person smokes
  • How many people who report that they smoke do not have any qualifications?
  • Filter cases where the respondents is either from London or Wales

Data Wrangling - Many Functions

  • You had to select the state and age variables, rename them, arrange them and filter some states
  • This is what the code would look like
census_data_selected <- select(census_data, state_fips_code, age)
census_data_renamed <- rename(census_data_selected, state = state_fips_code)
census_data_arranged <- arrange(census_data_renamed, state)
census_data_filtered <- filter(census_data_arranged, age >= 40)
  • Tiring, right?

Data Wrangling - Many Functions

- Fortunately, there is good news
- Introducing (drumrolls) pipe
- The pipe operator is written as %>% or |>
- Relies on a step-wise logic
- Useful in making the code more readable
- Eliminates the possibilities of making mistakes in code writing

Data Wrangling - Pipe

Original Code


census_data_selected <- select(
  census_data,
  state_fips_code,age
  )

census_data_renamed <- rename(
  census_data_selected,
  state = state_fips_code
  )

census_data_arranged <- arrange(
  census_data_renamed, 
  state
  )

census_data_filtered <- filter(
  census_data_arranged,
  age >= 40
  )

With Pipe


census_data |>
  select(state_fips_code, age) |>
  rename(state = state_fips_code) |>
  arrange(census_data_renamed, state) |>
  filter(census_data_arranged, age >= 40)

Data Wrangling - Group by and Summarise

Finding the total personal income by sex

census_data |>
  group_by(sex) |>
  summarise(total_income_sex = sum(total_personal_income, na.rm =T)
            )
# A tibble: 2 × 2
  sex    total_income_sex
  <fct>             <int>
1 Female          3278972
2 Male            8121064

Can you similarly calculate the total personal income by state?

census_data |>
  group_by(state_fips_code) |>
  summarise(total_income_state = sum(total_personal_income, na.rm = T)
            )

Do it Yourself - 4

  • In the smoking dataset, select only the highest_qualification and smoke_weekdays variables, rename the variables appropriately and calculate the total number of cigarettes smoke by people in each qualification
  • Find the number of people who smoke below the age of thirty and arrange the results in decreasing order of age
  • Select the income and smoke_weekend variables and calculate the total number of cigarettes smoke in each income category
  • What is the mean age of males and females?

Data Wrangling - case_when

case_when() is used to create a new variable using conditional logic

We want to convert income into income categories

census_data |>
  mutate(
    income_category = case_when(
      total_family_income >= 0 &
        total_family_income < 40000 ~ "Low Income",
      total_family_income >= 40000 &
        total_family_income < 100000 ~ "Middle Income",
      total_family_income >= 100000 ~ "High Income")
    )
# A tibble: 500 × 10
   census_year state_fips_code total_family_income   age sex    race_general   
         <int> <fct>                         <int> <int> <fct>  <fct>          
 1        2000 Florida                       14550    44 Male   Two major races
 2        2000 Florida                       22800    20 Female White          
 3        2000 Florida                           0    20 Male   Black          
 4        2000 Florida                       23000     6 Female White          
 5        2000 Florida                       48000    55 Male   White          
 6        2000 Florida                       74000    43 Female White          
 7        2000 Florida                       23000    60 Female White          
 8        2000 Florida                       74000    47 Female White          
 9        2000 Florida                       60000    54 Female Black          
10        2000 Florida                       14600    58 Female White          
# ℹ 490 more rows
# ℹ 4 more variables: marital_status <fct>, total_personal_income <int>,
#   total_income <int>, income_category <chr>

Do it Yourself - 5

  • Create age categories for the smoking dataset like 15-25, 26-40, 40-59 and 59+
  • Convert the gender variable into a numeric one, with female as 1 and male as 0
  • Create smoking categories as ‘High’ (for more than 30), ‘Medium’(11-30) and ‘Low’ (0-10) with the smoke_weekends variable

Thank you :)